59 research outputs found
Recommended from our members
Unifying regression testing with mutation testing
textSoftware testing is the most commonly used methodology for validating quality of software systems. Conceptually, testing is simple, but in practice, given the huge (practically infinite) space of inputs to test against, it requires solving a number of challenging problems, including evaluating and reusing tests efficiently and effectively as software evolves. While software testing research has seen much progress in recent years, many crucial bugs still evade state-of-the-art approaches and cause significant monetary losses and sometimes are responsible for loss of life. My thesis is that a unified, bi-dimensional, change-driven methodology can form the basis of novel techniques and tools that can make testing significantly more effective and efficient, and allow us to find more bugs at a reduced cost. We propose a novel unification of the following two dimensions of change: (1) real manual changes made by programmers, e.g., as commonly used to support more effective and efficient regression testing techniques; and (2) mechanically introduced changes to code or specifications, e.g., as originally conceived in mutation testing for evaluating quality of test suites. We believe such unification can lay the foundation of a scalable and highly effective methodology for testing and maintaining real software systems. The primary contribution of my thesis is two-fold. One, it introduces new techniques to address central problems in both regression testing (e.g., test prioritization) and mutation testing (e.g., selective mutation testing). Two, it introduces a new methodology that uses the foundations of regression testing to speed up mutation testing, and also uses the foundations of mutation testing to help with the fault localization problem raised in regression testing. The central ideas are embodied in a suite of prototype tools. Rigorous experimental evaluation is used to validate the efficacy of the proposed techniques using a variety of real-world Java programs.Electrical and Computer Engineerin
Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair
During Automated Program Repair (APR), it can be challenging to synthesize
correct patches for real-world systems in general-purpose programming
languages. Recent Large Language Models (LLMs) have been shown to be helpful
"copilots" in assisting developers with various coding tasks, and have also
been directly applied for patch synthesis. However, most LLMs treat programs as
sequences of tokens, meaning that they are ignorant of the underlying semantics
constraints of the target programming language. This results in plenty of
statically invalid generated patches, impeding the practicality of the
technique. Therefore, we propose Repilot, a framework to further copilot the AI
"copilots" (i.e., LLMs) by synthesizing more valid patches during the repair
process. Our key insight is that many LLMs produce outputs autoregressively
(i.e., token by token), resembling human writing programs, which can be
significantly boosted and guided through a Completion Engine. Repilot
synergistically synthesizes a candidate patch through the interaction between
an LLM and a Completion Engine, which 1) prunes away infeasible tokens
suggested by the LLM and 2) proactively completes the token based on the
suggestions provided by the Completion Engine. Our evaluation on a subset of
the widely-used Defects4j 1.2 and 2.0 datasets shows that Repilot fixes 66 and
50 bugs, respectively, surpassing the best-performing baseline by 14 and 16
bugs fixed. More importantly, Repilot is capable of producing more valid and
correct patches than the base LLM when given the same generation budget
NeuRI: Diversifying DNN Generation via Inductive Rule Inference
Deep Learning (DL) is prevalently used in various industries to improve
decision-making and automate processes, driven by the ever-evolving DL
libraries and compilers. The correctness of DL systems is crucial for trust in
DL applications. As such, the recent wave of research has been studying the
automated synthesis of test-cases (i.e., DNN models and their inputs) for
fuzzing DL systems. However, existing model generators only subsume a limited
number of operators, lacking the ability to pervasively model operator
constraints. To address this challenge, we propose NeuRI, a fully automated
approach for generating valid and diverse DL models composed of hundreds of
types of operators. NeuRI adopts a three-step process: (i) collecting valid and
invalid API traces from various sources; (ii) applying inductive program
synthesis over the traces to infer the constraints for constructing valid
models; and (iii) using hybrid model generation which incorporates both
symbolic and concrete operators. Our evaluation shows that NeuRI improves
branch coverage of TensorFlow and PyTorch by 24% and 15% over the
state-of-the-art model-level fuzzers. NeuRI finds 100 new bugs for PyTorch and
TensorFlow in four months, with 81 already fixed or confirmed. Of these, 9 bugs
are labelled as high priority or security vulnerability, constituting 10% of
all high-priority bugs of the period. Open-source developers regard
error-inducing tests reported by us as "high-quality" and "common in practice"
Fuzzing Deep-Learning Libraries via Automated Relational API Inference
A growing body of research has been dedicated to DL model testing. However,
there is still limited work on testing DL libraries, which serve as the
foundations for building, training, and running DL models. Prior work on
fuzzing DL libraries can only generate tests for APIs which have been invoked
by documentation examples, developer tests, or DL models, leaving a large
number of APIs untested. In this paper, we propose DeepREL, the first approach
to automatically inferring relational APIs for more effective DL library
fuzzing. Our basic hypothesis is that for a DL library under test, there may
exist a number of APIs sharing similar input parameters and outputs; in this
way, we can easily "borrow" test inputs from invoked APIs to test other
relational APIs. Furthermore, we formalize the notion of value equivalence and
status equivalence for relational APIs to serve as the oracle for effective bug
finding. We have implemented DeepREL as a fully automated end-to-end relational
API inference and fuzzing technique for DL libraries, which 1) automatically
infers potential API relations based on API syntactic or semantic information,
2) synthesizes concrete test programs for invoking relational APIs, 3)
validates the inferred relational APIs via representative test inputs, and
finally 4) performs fuzzing on the verified relational APIs to find potential
inconsistencies. Our evaluation on two of the most popular DL libraries,
PyTorch and TensorFlow, demonstrates that DeepREL can cover 157% more APIs than
state-of-the-art FreeFuzz. To date, DeepREL has detected 162 bugs in total,
with 106 already confirmed by the developers as previously unknown bugs.
Surprisingly, DeepREL has detected 13.5% of the high-priority bugs for the
entire PyTorch issue-tracking system in a three-month period. Also, besides the
162 code bugs, we have also detected 14 documentation bugs (all confirmed).Comment: Accepted at ESEC/FSE 202
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Program synthesis has been long studied with recent approaches focused on
directly using the power of Large Language Models (LLMs) to generate code
according to user intent written in natural language. Code evaluation datasets,
containing curated synthesis problems with input/output test-cases, are used to
measure the performance of various LLMs on code synthesis. However, test-cases
in these datasets can be limited in both quantity and quality for fully
assessing the functional correctness of the generated code. Such limitation in
the existing benchmarks begs the following question: In the era of LLMs, is the
code generated really correct? To answer this, we propose EvalPlus -- a code
synthesis benchmarking framework to rigorously evaluate the functional
correctness of LLM-synthesized code. In short, EvalPlus takes in the base
evaluation dataset and uses an automatic input generation step to produce and
diversify large amounts of new test inputs using both LLM-based and
mutation-based input generators to further validate the synthesized code. We
extend the popular HUMANEVAL benchmark and build HUMANEVAL+ with 81x
additionally generated tests. Our extensive evaluation across 14 popular LLMs
demonstrates that HUMANEVAL+ is able to catch significant amounts of previously
undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on
average! Moreover, we even found several incorrect ground-truth implementations
in HUMANEVAL. Our work not only indicates that prior popular code synthesis
evaluation results do not accurately reflect the true performance of LLMs for
code synthesis but also opens up a new direction to improve programming
benchmarks through automated test input generation
- …